Haven Package
install.packages("haven")Import SAS data with haven
haven is an extremely easy-to-use package to import data from three software packages: SAS, STATA and SPSS. Depending on the software, you use different functions:
SAS:
read_sas()
STATA:read_dta()(orread_stata(), which are identical)
SPSS:read_sav()orread_por(), depending on the file type.
All these functions take one key argument: the path to your local file. In fact, you can even pass a URL; haven will then automatically download the file for you before importing it.
You’ll be working with data on the age, gender, income, and purchase level (0 = low, 1 = high) of 36 individuals (Source: SAS). The information is stored in a SAS file, sales.sas7bdat, which is available in your current working directory. You can also download the data here.
# Load the haven package
library(haven)
# Import sales.sas7bdat: sales
sales <- read_sas("../xDatasets/sales.sas7bdat")
# Display the structure of sales
str(sales)## Classes 'tbl_df', 'tbl' and 'data.frame': 431 obs. of 4 variables:
## $ purchase: num 0 0 1 1 0 0 0 0 0 0 ...
## $ age : num 41 47 41 39 32 32 33 45 43 40 ...
## $ gender : chr "Female" "Female" "Female" "Female" ...
## $ income : chr "Low" "Low" "Low" "Low" ...
## - attr(*, "label")= chr "SALES"
Import STATA data with haven
Next up are STATA data files; you can use read_dta() for these.
When inspecting the result of the read_dta() call, you will notice that one column will be imported as a labelled vector, an R equivalent for the common data structure in other statistical environments. In order to effectively continue working on the data in R, it’s best to change this data into a standard R class. To convert a variable of the class labelled to a factor, you’ll need haven’s as_factor() function.
In this exercise, you will work with data on yearly import and export numbers of sugar, both in USD and in weight. The data can be found at: http://assets.datacamp.com/production/course_1478/datasets/trade.dta
# Import the data from the URL: sugar
sugar <- read_stata("http://assets.datacamp.com/production/course_1478/datasets/trade.dta")
# Structure of sugar
str(sugar)## Classes 'tbl_df', 'tbl' and 'data.frame': 10 obs. of 5 variables:
## $ Date : 'haven_labelled' num 10 9 8 7 6 5 4 3 2 1
## ..- attr(*, "label")= chr "Date"
## ..- attr(*, "format.stata")= chr "%9.0g"
## ..- attr(*, "labels")= Named num 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr "2004-12-31" "2005-12-31" "2006-12-31" "2007-12-31" ...
## $ Import : num 37664782 16316512 11082246 35677943 9879878 ...
## ..- attr(*, "label")= chr "Import"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ Weight_I: num 54029106 21584365 14526089 55034932 14806865 ...
## ..- attr(*, "label")= chr "Weight_I"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ Export : num 5.45e+07 1.03e+08 3.79e+07 4.85e+07 7.15e+07 ...
## ..- attr(*, "label")= chr "Export"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ Weight_E: num 9.34e+07 1.58e+08 8.80e+07 1.12e+08 1.32e+08 ...
## ..- attr(*, "label")= chr "Weight_E"
## ..- attr(*, "format.stata")= chr "%9.0g"
## - attr(*, "label")= chr "Written by R."
# Convert values in Date column to dates
sugar$Date <- as.Date(as_factor(sugar$Date))
# Structure of sugar again
sugar %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| Date | Import | Weight_I | Export | Weight_E |
|---|---|---|---|---|
| 2013-12-31 | 37664782 | 54029106 | 54505513 | 93350013 |
| 2012-12-31 | 16316512 | 21584365 | 102700010 | 158000010 |
| 2011-12-31 | 11082246 | 14526089 | 37935000 | 88000000 |
| 2010-12-31 | 35677943 | 55034932 | 48515008 | 112000005 |
| 2009-12-31 | 9879878 | 14806865 | 71486545 | 131800000 |
| 2008-12-31 | 1539992 | 1749318 | 12311696 | 18500014 |
| 2007-12-31 | 28021 | 54567 | 16489813 | 39599944 |
| 2006-12-31 | 2652 | 3821 | 29273920 | 102072480 |
| 2005-12-31 | 7067402 | 23722957 | 46497438 | 147583380 |
| 2004-12-31 | 1033672 | 1964980 | 27131638 | 78268792 |
What does the graphic tell?
A plot can be very useful to explore the relationship between two variables. If you pass the plot() function two arguments, the first one will be plotted on the x-axis, the second one will be plotted on the y-axis.
The sugar trading data is again available at http://assets.datacamp.com/production/course_1478/datasets/trade.dta.
After you’ve imported the data frame, you should plot two of its variables, Import against Weight_I, and describe their relationship! haven is already loaded in your R session, so you can start importing straight away.
trade <- read_stata("http://assets.datacamp.com/production/course_1478/datasets/trade.dta")
plot(trade$Import, trade$Weight_I)The import figures in USD and the import figures in weight are rather positively correlated.
You can spot an increasing trend among the data points. This of course makes sense: the more sugar is traded, the higher the weight that’s traded
Import SPSS data with haven
The haven package can also import data files from SPSS. Again, importing the data is pretty straightforward. Depending on the SPSS data file you’re working with, you’ll need either read_sav() - for .sav files - or read_por() - for .por files.
In this exercise, you will work with data on four of the Big Five personality traits for 434 persons (Source: University of Bath). The Big Five is a psychological concept including, originally, five dimensions of personality to classify human personality. The SPSS dataset is called person.sav and is available in your working directory.
# Import person.sav: traits
traits <- read_sav("../xDatasets/person.sav")
# Summarize traits
do.call(cbind, lapply(traits, summary))## Neurotic Extroversion Agreeableness Conscientiousness
## Min. 0.00000 5.00000 15.0000 7.00000
## 1st Qu. 18.00000 26.00000 39.0000 25.00000
## Median 24.00000 31.00000 45.0000 30.00000
## Mean 23.62619 30.22967 44.5494 30.85238
## 3rd Qu. 29.00000 34.00000 50.0000 36.00000
## Max. 44.00000 65.00000 73.0000 58.00000
## NA's 14.00000 16.00000 19.0000 14.00000
# Print out a subset
subset(traits, Extroversion > 40 & Agreeableness > 40) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| Neurotic | Extroversion | Agreeableness | Conscientiousness |
|---|---|---|---|
| 38 | 43 | 49 | 29 |
| 20 | 42 | 46 | 31 |
| 18 | 42 | 49 | 31 |
| 42 | 43 | 44 | 29 |
| 30 | 42 | 51 | 24 |
| 18 | 42 | 50 | 25 |
| 27 | 45 | 55 | 23 |
| 18 | 43 | 57 | 34 |
Factorize, round two
In the last exercise you learned how to import a data file using the command read_sav(). With SPSS data files, it can also happen that some of the variables you import have the labelled class. This is done to keep all the labelling information that was originally present in the .sav and .por files. It’s advised to coerce (or change) these variables to factors or other standard R classes.
The data for this exercise involves information on employees and their demographic and economic attributes (Source: QRiE). The data can be found on the following URL:
http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav
# Import SPSS data from the URL: work
work <- read_sav("http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav")
# Display summary of work$GENDER
summary(work$GENDER)## Length Class Mode
## 474 haven_labelled character
# Convert work$GENDER to a factor
work$GENDER <- as_factor(work$GENDER)
# Display summary of work$GENDER again
summary(work$GENDER)## Female Male
## 216 258
The as_factor() function has more arguments you can specify, have a look at its documentation to discover more!
foreign package
install.packages("foreign")Import STATA data with foreign (1)
The foreign package offers a simple function to import and read STATA data: read.dta().
In this exercise, you will import data on the US presidential elections in the year 2000. The data in florida.dta contains the total numbers of votes for each of the four candidates as well as the total number of votes per election area in the state of Florida (Source: Florida Department of State). The file is available in your working directory, you can download it here if you want to experiment some more.
# Load the foreign package
library(foreign)
# Import florida.dta and name the resulting data frame florida
florida <- read.dta("../xDatasets/florida.dta")
# Check tail() of florida
florida %>%
tail() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| gore | bush | buchanan | nader | total | |
|---|---|---|---|---|---|
| 62 | 2647 | 4051 | 27 | 59 | 6784 |
| 63 | 1399 | 2326 | 26 | 29 | 3780 |
| 64 | 97063 | 82214 | 396 | 2436 | 182109 |
| 65 | 3835 | 4511 | 46 | 149 | 8541 |
| 66 | 5637 | 12176 | 120 | 265 | 18198 |
| 67 | 2796 | 4983 | 88 | 93 | 7960 |
Import STATA data with foreign (2)
Data can be very diverse, going from character vectors to categorical variables, dates and more. It’s in these cases that the additional arguments of read.dta() will come in handy.
The arguments you will use most often are convert.dates, convert.factors, missing.type and convert.underscore. Their meaning is pretty straightforward. It’s all about correctly converting STATA data to standard R data structures. Type ?read.dta to find out about about the default values.
The dataset for this exercise contains socio-economic measures and access to education for different individuals (Source: World Bank). This data is available as edequality.dta, which is located in the worldbank folder in your working directory.
# Specify the file path using file.path(): path
path <- file.path("../xDatasets","edequality.dta")
# Create and print structure of edu_equal_1
edu_equal_1 <- read.dta(path)
str(edu_equal_1)## 'data.frame': 12214 obs. of 27 variables:
## $ hhid : num 1 1 1 2 2 3 4 4 5 6 ...
## $ hhweight : num 627 627 627 627 627 ...
## $ location : Factor w/ 2 levels "urban location",..: 1 1 1 1 1 2 2 2 1 1 ...
## $ region : Factor w/ 9 levels "Sofia city","Bourgass",..: 8 8 8 9 9 4 4 4 8 8 ...
## $ ethnicity_head : Factor w/ 4 levels "Bulgaria","Turks",..: 2 2 2 1 1 1 1 1 1 1 ...
## $ age : num 37 11 8 73 70 75 79 80 82 83 ...
## $ gender : Factor w/ 2 levels "male","female": 2 2 1 1 2 1 1 2 2 2 ...
## $ relation : Factor w/ 9 levels "head ",..: 1 3 3 1 2 1 1 2 1 1 ...
## $ literate : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
## $ income_mnt : num 13.3 13.3 13.3 142.5 142.5 ...
## $ income : num 160 160 160 1710 1710 ...
## $ aggregate : num 1042 1042 1042 3271 3271 ...
## $ aggr_ind_annual : num 347 347 347 1635 1635 ...
## $ educ_completed : int 2 4 4 4 3 3 3 3 4 4 ...
## $ grade_complete : num 4 3 0 3 4 4 4 4 5 5 ...
## $ grade_all : num 4 11 8 11 8 8 8 8 13 13 ...
## $ unemployed : int 2 1 1 1 1 1 1 1 1 1 ...
## $ reason_OLF : int NA NA NA 3 3 3 9 9 3 3 ...
## $ sector : int NA NA NA NA NA NA 1 1 NA NA ...
## $ occupation : int NA NA NA NA NA NA 5 5 NA NA ...
## $ earn_mont : num 0 0 0 0 0 0 20 20 0 0 ...
## $ earn_ann : num 0 0 0 0 0 0 240 240 0 0 ...
## $ hours_week : num NA NA NA NA NA NA 30 35 NA NA ...
## $ hours_mnt : num NA NA NA NA NA ...
## $ fulltime : int NA NA NA NA NA NA 1 1 NA NA ...
## $ hhexp : num 100 100 100 343 343 ...
## $ legacy_pension_amt: num NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "datalabel")= chr ""
## - attr(*, "time.stamp")= chr ""
## - attr(*, "formats")= chr "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
## - attr(*, "types")= int 100 100 108 108 108 100 108 108 108 100 ...
## - attr(*, "val.labels")= chr "" "" "location" "region" ...
## - attr(*, "var.labels")= chr "hhid" "hhweight" "location" "region" ...
## - attr(*, "expansion.fields")=List of 12
## ..$ : chr "_dta" "_svy_su1" "cluster"
## ..$ : chr "_dta" "_svy_strata1" "strata"
## ..$ : chr "_dta" "_svy_stages" "1"
## ..$ : chr "_dta" "_svy_version" "2"
## ..$ : chr "_dta" "__XijVarLabcons" "(sum) cons"
## ..$ : chr "_dta" "ReS_Xij" "cons"
## ..$ : chr "_dta" "ReS_str" "0"
## ..$ : chr "_dta" "ReS_j" "group"
## ..$ : chr "_dta" "ReS_ver" "v.2"
## ..$ : chr "_dta" "ReS_i" "hhid dur"
## ..$ : chr "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc, gall, health, rent, durables we"| __truncated__
## ..$ : chr "_dta" "note0" "1"
## - attr(*, "version")= int 7
## - attr(*, "label.table")=List of 12
## ..$ location: Named int 1 2
## .. ..- attr(*, "names")= chr "urban location" "rural location"
## ..$ region : Named int 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr "Sofia city" "Bourgass" "Varna" "Lovetch" ...
## ..$ ethnic : Named int 1 2 3 4
## .. ..- attr(*, "names")= chr "Bulgaria" "Turks" "Roma" "Other"
## ..$ s2_q2 : Named int 1 2
## .. ..- attr(*, "names")= chr "male" "female"
## ..$ s2_q3 : Named int 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr "head " "spouse/partner " "child " "son/daughter-in-law " ...
## ..$ lit : Named int 1 2
## .. ..- attr(*, "names")= chr "no" "yes"
## ..$ : Named int 1 2 3 4
## .. ..- attr(*, "names")= chr "never attanded" "primary" "secondary" "postsecondary"
## ..$ : Named int 1 2
## .. ..- attr(*, "names")= chr "Not unemployed" "Unemployed"
## ..$ : Named int 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr "student" "housewife/childcare" "in retirement" "illness, disability" ...
## ..$ : Named int 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr "agriculture" "mining" "manufacturing" "utilities" ...
## ..$ : Named int 1 2 3 4 5
## .. ..- attr(*, "names")= chr "private company" "public works program" "government,public sector, army" "private individual" ...
## ..$ : Named int 1 2
## .. ..- attr(*, "names")= chr "no" "yes"
# Create and print structure of edu_equal_2
edu_equal_2 <- read.dta(path,
convert.factors = FALSE)
str(edu_equal_2)## 'data.frame': 12214 obs. of 27 variables:
## $ hhid : num 1 1 1 2 2 3 4 4 5 6 ...
## $ hhweight : num 627 627 627 627 627 ...
## $ location : int 1 1 1 1 1 2 2 2 1 1 ...
## $ region : int 8 8 8 9 9 4 4 4 8 8 ...
## $ ethnicity_head : int 2 2 2 1 1 1 1 1 1 1 ...
## $ age : num 37 11 8 73 70 75 79 80 82 83 ...
## $ gender : int 2 2 1 1 2 1 1 2 2 2 ...
## $ relation : int 1 3 3 1 2 1 1 2 1 1 ...
## $ literate : int 1 2 2 2 2 2 2 2 2 2 ...
## $ income_mnt : num 13.3 13.3 13.3 142.5 142.5 ...
## $ income : num 160 160 160 1710 1710 ...
## $ aggregate : num 1042 1042 1042 3271 3271 ...
## $ aggr_ind_annual : num 347 347 347 1635 1635 ...
## $ educ_completed : int 2 4 4 4 3 3 3 3 4 4 ...
## $ grade_complete : num 4 3 0 3 4 4 4 4 5 5 ...
## $ grade_all : num 4 11 8 11 8 8 8 8 13 13 ...
## $ unemployed : int 2 1 1 1 1 1 1 1 1 1 ...
## $ reason_OLF : int NA NA NA 3 3 3 9 9 3 3 ...
## $ sector : int NA NA NA NA NA NA 1 1 NA NA ...
## $ occupation : int NA NA NA NA NA NA 5 5 NA NA ...
## $ earn_mont : num 0 0 0 0 0 0 20 20 0 0 ...
## $ earn_ann : num 0 0 0 0 0 0 240 240 0 0 ...
## $ hours_week : num NA NA NA NA NA NA 30 35 NA NA ...
## $ hours_mnt : num NA NA NA NA NA ...
## $ fulltime : int NA NA NA NA NA NA 1 1 NA NA ...
## $ hhexp : num 100 100 100 343 343 ...
## $ legacy_pension_amt: num NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "datalabel")= chr ""
## - attr(*, "time.stamp")= chr ""
## - attr(*, "formats")= chr "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
## - attr(*, "types")= int 100 100 108 108 108 100 108 108 108 100 ...
## - attr(*, "val.labels")= chr "" "" "location" "region" ...
## - attr(*, "var.labels")= chr "hhid" "hhweight" "location" "region" ...
## - attr(*, "expansion.fields")=List of 12
## ..$ : chr "_dta" "_svy_su1" "cluster"
## ..$ : chr "_dta" "_svy_strata1" "strata"
## ..$ : chr "_dta" "_svy_stages" "1"
## ..$ : chr "_dta" "_svy_version" "2"
## ..$ : chr "_dta" "__XijVarLabcons" "(sum) cons"
## ..$ : chr "_dta" "ReS_Xij" "cons"
## ..$ : chr "_dta" "ReS_str" "0"
## ..$ : chr "_dta" "ReS_j" "group"
## ..$ : chr "_dta" "ReS_ver" "v.2"
## ..$ : chr "_dta" "ReS_i" "hhid dur"
## ..$ : chr "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc, gall, health, rent, durables we"| __truncated__
## ..$ : chr "_dta" "note0" "1"
## - attr(*, "version")= int 7
## - attr(*, "label.table")=List of 12
## ..$ location: Named int 1 2
## .. ..- attr(*, "names")= chr "urban location" "rural location"
## ..$ region : Named int 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr "Sofia city" "Bourgass" "Varna" "Lovetch" ...
## ..$ ethnic : Named int 1 2 3 4
## .. ..- attr(*, "names")= chr "Bulgaria" "Turks" "Roma" "Other"
## ..$ s2_q2 : Named int 1 2
## .. ..- attr(*, "names")= chr "male" "female"
## ..$ s2_q3 : Named int 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr "head " "spouse/partner " "child " "son/daughter-in-law " ...
## ..$ lit : Named int 1 2
## .. ..- attr(*, "names")= chr "no" "yes"
## ..$ : Named int 1 2 3 4
## .. ..- attr(*, "names")= chr "never attanded" "primary" "secondary" "postsecondary"
## ..$ : Named int 1 2
## .. ..- attr(*, "names")= chr "Not unemployed" "Unemployed"
## ..$ : Named int 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr "student" "housewife/childcare" "in retirement" "illness, disability" ...
## ..$ : Named int 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr "agriculture" "mining" "manufacturing" "utilities" ...
## ..$ : Named int 1 2 3 4 5
## .. ..- attr(*, "names")= chr "private company" "public works program" "government,public sector, army" "private individual" ...
## ..$ : Named int 1 2
## .. ..- attr(*, "names")= chr "no" "yes"
# Create and print structure of edu_equal_3
edu_equal_3 <- read.dta(path,
convert.underscore = TRUE)
str(edu_equal_3)## 'data.frame': 12214 obs. of 27 variables:
## $ hhid : num 1 1 1 2 2 3 4 4 5 6 ...
## $ hhweight : num 627 627 627 627 627 ...
## $ location : Factor w/ 2 levels "urban location",..: 1 1 1 1 1 2 2 2 1 1 ...
## $ region : Factor w/ 9 levels "Sofia city","Bourgass",..: 8 8 8 9 9 4 4 4 8 8 ...
## $ ethnicity.head : Factor w/ 4 levels "Bulgaria","Turks",..: 2 2 2 1 1 1 1 1 1 1 ...
## $ age : num 37 11 8 73 70 75 79 80 82 83 ...
## $ gender : Factor w/ 2 levels "male","female": 2 2 1 1 2 1 1 2 2 2 ...
## $ relation : Factor w/ 9 levels "head ",..: 1 3 3 1 2 1 1 2 1 1 ...
## $ literate : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
## $ income.mnt : num 13.3 13.3 13.3 142.5 142.5 ...
## $ income : num 160 160 160 1710 1710 ...
## $ aggregate : num 1042 1042 1042 3271 3271 ...
## $ aggr.ind.annual : num 347 347 347 1635 1635 ...
## $ educ.completed : int 2 4 4 4 3 3 3 3 4 4 ...
## $ grade.complete : num 4 3 0 3 4 4 4 4 5 5 ...
## $ grade.all : num 4 11 8 11 8 8 8 8 13 13 ...
## $ unemployed : int 2 1 1 1 1 1 1 1 1 1 ...
## $ reason.OLF : int NA NA NA 3 3 3 9 9 3 3 ...
## $ sector : int NA NA NA NA NA NA 1 1 NA NA ...
## $ occupation : int NA NA NA NA NA NA 5 5 NA NA ...
## $ earn.mont : num 0 0 0 0 0 0 20 20 0 0 ...
## $ earn.ann : num 0 0 0 0 0 0 240 240 0 0 ...
## $ hours.week : num NA NA NA NA NA NA 30 35 NA NA ...
## $ hours.mnt : num NA NA NA NA NA ...
## $ fulltime : int NA NA NA NA NA NA 1 1 NA NA ...
## $ hhexp : num 100 100 100 343 343 ...
## $ legacy.pension.amt: num NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "datalabel")= chr ""
## - attr(*, "time.stamp")= chr ""
## - attr(*, "formats")= chr "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
## - attr(*, "types")= int 100 100 108 108 108 100 108 108 108 100 ...
## - attr(*, "val.labels")= chr "" "" "location" "region" ...
## - attr(*, "var.labels")= chr "hhid" "hhweight" "location" "region" ...
## - attr(*, "expansion.fields")=List of 12
## ..$ : chr "_dta" "_svy_su1" "cluster"
## ..$ : chr "_dta" "_svy_strata1" "strata"
## ..$ : chr "_dta" "_svy_stages" "1"
## ..$ : chr "_dta" "_svy_version" "2"
## ..$ : chr "_dta" "__XijVarLabcons" "(sum) cons"
## ..$ : chr "_dta" "ReS_Xij" "cons"
## ..$ : chr "_dta" "ReS_str" "0"
## ..$ : chr "_dta" "ReS_j" "group"
## ..$ : chr "_dta" "ReS_ver" "v.2"
## ..$ : chr "_dta" "ReS_i" "hhid dur"
## ..$ : chr "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc, gall, health, rent, durables we"| __truncated__
## ..$ : chr "_dta" "note0" "1"
## - attr(*, "version")= int 7
## - attr(*, "label.table")=List of 12
## ..$ location: Named int 1 2
## .. ..- attr(*, "names")= chr "urban location" "rural location"
## ..$ region : Named int 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr "Sofia city" "Bourgass" "Varna" "Lovetch" ...
## ..$ ethnic : Named int 1 2 3 4
## .. ..- attr(*, "names")= chr "Bulgaria" "Turks" "Roma" "Other"
## ..$ s2_q2 : Named int 1 2
## .. ..- attr(*, "names")= chr "male" "female"
## ..$ s2_q3 : Named int 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr "head " "spouse/partner " "child " "son/daughter-in-law " ...
## ..$ lit : Named int 1 2
## .. ..- attr(*, "names")= chr "no" "yes"
## ..$ : Named int 1 2 3 4
## .. ..- attr(*, "names")= chr "never attanded" "primary" "secondary" "postsecondary"
## ..$ : Named int 1 2
## .. ..- attr(*, "names")= chr "Not unemployed" "Unemployed"
## ..$ : Named int 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr "student" "housewife/childcare" "in retirement" "illness, disability" ...
## ..$ : Named int 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr "agriculture" "mining" "manufacturing" "utilities" ...
## ..$ : Named int 1 2 3 4 5
## .. ..- attr(*, "names")= chr "private company" "public works program" "government,public sector, army" "private individual" ...
## ..$ : Named int 1 2
## .. ..- attr(*, "names")= chr "no" "yes"
Can you tell the difference between the different versions of read.dta()? For this data, the first version (where you simply specified the file path) will be most useful to work with.
Do you know your data?
The previous exercise dealt about socio-economic indicators and access to education of different individuals. The edu_equal_1 dataset that you’ve built is already available in the workspace. Now that you have it in R, it’s pretty easy to get some basic insights.
For example, you can ask yourself how many observations (e.g. how many people) have an age higher than 40 and are literate? When you call
str(edu_equal_1)
You’ll see that age is an integer, and literate is a factor, with the levels “yes” and “no”. The following expression thus answers the question:
nrow(subset(edu_equal_1, age > 40 & literate == "yes"))
Up to you to answer a similar question now:
How many observations/individuals from Bulgaria have an income above 1000?
nrow(subset(edu_equal_1, ethnicity_head == "Bulgaria" & income > 1000))## [1] 8997
The data frame contains 8997 individuals with a Bulgarian ethnicity whose income is above 1000.
Import SPSS data with foreign (1)
All great things come in pairs. Where foreign provided read.dta() to read Stata data, there’s also read.spss() to read SPSS data files. To get a data frame, make sure to set to.data.frame = TRUE inside read.spss().
In this exercise, you’ll be working with socio-economic variables from different countries (Source: Quantative Data Analysis in Education). The SPSS data is in a file called international.sav, which is in your working directory. You can also download it here if you want to play around with it some more.
# Import international.sav as a data frame: demo
demo <- read.spss("../xDatasets/international.sav", to.data.frame = TRUE)## re-encoding from CP1252
# Create boxplot of gdp variable of demo
boxplot(demo$gdp)Correlation
If you’re familiar with statistics, you’ll have heard about Pearson's Correlation. It is a measurement to evaluate the linear dependency between two variables, say X and Y. It can range from -1 to 1; if it’s close to 1 it means that there is a strong positive association between the variables. If X is high, also Y tends to be high. If it’s close to -1, there is a strong negative association: If X is high, Y tends to be low. When the Pearson correlation between two variables is 0, these variables are possibly independent: there is no association between X and Y.
We can calculate the correlation between two vectors with the cor() function. Take this code for example, that computes the correlation between the columns height and width of a fictional data frame size:
cor(size$height, size$width)
What is the correlation coefficient for the two numerical variables gdp and f_illit (female illiteracy rate)?
cor(demo$gdp, demo$f_illit)## [1] -0.4476856
The value of the correlation cofficient is equal to -0.4476856. That indicates a negative association among GDP and female illiteracy.
Import SPSS data with foreign (2)
In the previous exercise, you used the to.data.frame argument inside read.spss(). There are many other ways in which to customize the way your SPSS data is imported.
In this exercise you will experiment with another argument, use.value.labels. It specifies whether variables with value labels should be converted into R factors with levels that are named accordingly. The argument is TRUE by default which means that so called labelled variables inside SPSS are converted to factors inside R.
# Import international.sav as demo_1
demo_1 <- read.spss("../xDatasets/international.sav", to.data.frame = TRUE)## re-encoding from CP1252
# Print out the head of demo_1
demo_1 %>%
head() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| id | country | contint | m_illit | f_illit | lifeexpt | gdp |
|---|---|---|---|---|---|---|
| 1 | Argentina | Americas | 3.0 | 3.0 | 16 | 3375 |
| 2 | Benin | Africa | 45.2 | 74.5 | 7 | 521 |
| 3 | Burundi | Africa | 33.2 | 48.1 | 5 | 86 |
| 4 | Chile | Americas | 4.2 | 4.4 | 14 | 4523 |
| 5 | Dominican Republic | Americas | 12.0 | 12.7 | 12 | 2408 |
| 6 | El Salvador | Americas | 17.6 | 22.9 | 11 | 2302 |
# Import international.sav as demo_2
demo_2 <- read.spss("../xDatasets/international.sav", to.data.frame = TRUE, use.value.labels = FALSE)## re-encoding from CP1252
# Print out the head of demo_2
demo_2 %>%
head() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| id | country | contint | m_illit | f_illit | lifeexpt | gdp |
|---|---|---|---|---|---|---|
| 1 | Argentina | 2 | 3.0 | 3.0 | 16 | 3375 |
| 2 | Benin | 1 | 45.2 | 74.5 | 7 | 521 |
| 3 | Burundi | 1 | 33.2 | 48.1 | 5 | 86 |
| 4 | Chile | 2 | 4.2 | 4.4 | 14 | 4523 |
| 5 | Dominican Republic | 2 | 12.0 | 12.7 | 12 | 2408 |
| 6 | El Salvador | 2 | 17.6 | 22.9 | 11 | 2302 |